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Abstract 

In this paper we derive an efficient message-passing algorithm to learn the parameters 
of structured predictors in general graphical models. We define the extended log- loss, which 
relates the log-loss of CRFs and the hinge-loss of structured SVMs through a temperature 
parameter. We then investigate the primal and dual properties of the extended log-loss, 
showing that the dual programs of both CRFs and structured SVMs perform moment 
matching using different selection rules. Utilizing the graphical models of the predictors, 
we describe a low dimensional extended log-loss formulation, which is derived from pseudo 
moment matching and a fractional entropy approximation selection rule in the dual set- 
ting. We propose an efficient message-passing algorithm that is guaranteed to converge 
to the optimum of the low dimensional primal and dual programs. Unlike many of the 
existing approaches, this allows us to learn efficiently high-order graphical models, over 
regions of any size, and very large number of parameters. We demonstrate the effectiveness 
of our approach, while presenting state-of-the-art results in stereo estimation, semantic 
segmentation, shape reconstruction, and indoor scene understanding. 



1. Introduction 

Structured prediction is an effective framework to reason about real-life problems since it 
provides the means to map objects x to labels y. Typically, the label space has rich in- 
ternal structure, e.g., semantic segmentations or depth estimations, and the set of possible 
labels for a given object is typically exponential in its size. Ideally, one would want to 
make joint predictions on the structured labels instead of simply predicting each element 
independently, as this additionally accounts for the statistical correlations between label 
elements, as well as between training objects and their labels. These properties make struc- 



tured prediction appealing for a wide range of applications in computer vision Felzenszwalb 



et al. (2010); Szeliski et al. ( |2007| ) as well as in natural language processing |Koo et al.| ( |2010[ ) 
and computational biology 



Yanover et al. (2007); Sontag et al. (2008). 



Learning the parameters of structured predictors greatly influences the prediction ac- 
curacy. Several models have been recently proposed, including log-likelihood models such 
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as conditional random fields (CRFs, Lafferty et al. ( 2001[ )), and structured support vector 



machines (structured SVMs) such as maximum- margin Markov networks (M3Ns Taskar 



et al. (2004)) and structured output learning (Tsochantaridis et al. ( 2004[ )). For CRFs, the 



parameters estimation is done by minimizing a convex function composed of a negative log- 
likelihood loss and a regularization term. Learning the parameters with structured SVMs 
is done by minimizing the convex regularized structured hinge loss. 

Despite the convexity of the objective functions, finding the optimal parameters of 
these models can be computationally expensive since it involves comparing the training 
labels with the predicted labels, which are computed from the space of exponentially many 
possible labels. When the label structure corresponds to a tree, learning can be done effi- 
ciently by using belief propagation as a subroutine; The sum-product algorithm is typically 
used in CRFs and the max-product algorithm in structured SVMs. In general, when the 
label structure corresponds to a general graph, one cannot compute the objective nor the 
gradient exactly, except for some special cases in structured SVMs, such as matching and 



sub- modular functions (e.g., |Taskar et al. ( 2006|) ) . Ther efore , one usually resorts to ap- 
proximate inference algorithms (cf. Finley and Joachims (2008); Levin and Weiss ( 2006| )). 



However, the approximate inference algorithms are computationally expensive to be used 
as a subroutine of the learning algorithm, therefore they cannot be applied efficiently to 
learn the parameters of structured predictors. 

In this paper we derive an efficient message-passing algorithm to learn the parameters 
of structured predictors in general graphical models. First, we define the extended log- 
loss, which relates the log-loss of CRFs and the hinge-loss of structured SVMs through 
a temperature parameter. As a consequence we show that CRFs smoothly approximate 
structured SVMs in low temperatures. We then investigate the primal and dual properties 
of the extended log-loss, showing that the dual programs of both CRFs and structured SVMs 
perform moment matching using different selection rules. Utilizing the graphical models of 
the predictors, we describe a low dimensional extended log-loss formulation, which is derived 
from pseudo moment matching and a fractional entropy approximation selection rule in the 
dual setting. Next we propose an efficient message-passing algorithm that is guaranteed to 
converge to the optimum of the low dimensional primal and dual programs. Unlike many of 
the existing approaches, this allows us to learn efficiently high-order graphical models, over 
regions of any size, and very large number of parameters. We demonstrate the effectiveness 
of our approach, while presenting state-of-the-art results in stereo estimation, semantic 
segmentation, shape reconstruction, and indoor scene understanding. 

The rest of the paper is organized as follows. In Section [2] we review parameter learning 
methods focusing on its most common models, CRFs and structured SVMs. We present 



the extended log-loss which relates CRFs and structured SVMs in Section [2T] and describe 
the necessary background about graphical models and approximate inference in Section 2.2 
Next, in Section |3j we formulate the Fenchel duality theorem we are using throughout our 
work, and derive the dual of the extended log-loss program, while emphasizing its moment 
matching properties. In Section |4] we describe the pseudo moment matching formulation and 
its primal low dimensional extended log-loss, as well as derive an efficient message-passing 
for general graphical models. We present some theoretical extensions in Section [5| We 
demonstrate the effectiveness of our approach in Section [6| describing our state-of-the-art 
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results in stereo estimation, semantic segmentation, shape reconstruction, and indoor scene 
understanding. 

2. Background 

Structured prediction typically involves objects x G X and their labels y ^ y. The structure 
is usually incorporated into the labels which may be sequences, trees, grids, or other high- 
dimensional objects with internal structure. For every object x, its possible labels are 
described by a feature function ^ : X x y ^ R^. Our goal is to learn the parameters of 
the linear prediction rule 

yw{x) — argmax 

with parameters w G M^, such that yw{x) is a good approximation to the true label of x. 
Intuitively one would like to learn the parameters of structured predictors by minimizing the 
training loss ^(y, yw{x)) incurred by using w to predict the label of x, given that the true label 
is y. Since the prediction is norm-insensitive this method can lead to over fitting. Therefore, 
given a training set {x^y) G 5, the parameters w are usually learned by minimizing a norm- 
dependent loss 

J2 i{w,x,y) + ^\\w\\l (1) 

{x,y)eS 

The function i{w^ x, y) is a surrogate of the true loss i{y^ y^(x)). The surrogate loss function 
determines the learning setting for the prediction problem, e.g., structured SVMs and CRFs. 
Structured SVMs aim at minimizing the surrogate hinge loss, presented by |Taskar et al. 



( |2004D ; |Tsochantaridis et al.| ( |2006| ): 



4m^e(^, X, y) = max I l{y, y) + w^^{x, y) - w^^{x, y) [. 

The structured hinge loss upper bounds the true loss function. It corresponds to a maximum- 
margin approach that linearly penalizes predictions yw{x) that violates a training pair 
(x, ?/) G 5 by more than l{y, yw{x)), i.e., ^^$(x, y) < i(y, yw(x)) + ^^$(x, y^(x)). 

The second loss function that we consider is based on log-linear models, and is commonly 



used in CRFs, defined by |Lafferty et al.| ( |2001[ ). For every training example we define a 



(conditional) Gibbs distribution 

P{x,y){m^) exp (^£{y,y) + w'^^{x,y)y (2) 

The Gibbs distribution provides a probabilistic prediction rule, which scales the different 
predictions according to their prediction value. The surrogate loss function is then the 
negative log-likelihood under the parameters w 

Ilog{w,x,y) = - \ogp(^^^y){y]w). 

In structured SVMs and CRFs a convex loss function and a convex regularization are 
minimized, and gradient based methods can recover their optimal solutions. 
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2.1 One parameter extension of CRFs and Structured SVMs 

In CRFs one aims to minimize the regularized negative log-likelihood of the distribution 
P{x,y){y'i^)- The regularized log-loss is a convex and smooth function and its gradient 
measures the disagreements between the objects predicted labels and their training labels 

^ ^^""^^^'y^^^ = ^p^^y^{y'^w)(t)k{x,y) - (l)k{x,y). 

The computational complexity of CRFs is governed by the gradient computation. 

Structured SVMs aim at minimizing the regularized hinge loss Ihinge{^ i x ^ y) . The 
hinge loss involves the max-function, which is a convex and non-smooth function. However 
every convex function has subgradients, i.e. supporting hyperplanes to its epigraph (cf. 



Rockafellar (1970)). The subgradients generalize the concept of the gradient since a convex 



function is smooth if and only if has a single subgradient, namely its gradient. Danskin's 



theorem (e.g., Bertsekas et al. (2003), Theorem 4.5.1) states that the subgradients of the 
max-function correspond to probability distributions P(x,y)(y*'-)^) over the optimal set 3^* = 
argmax^^j;{^(?/, ^) + w^^{x,y)}. Therefore the subdifferential of the hinge-loss takes the 
form 

Q = 2^ P(x,y)[y ]w)(i)k{x,y )-(i)k{x,y). 

Unlike the smooth case, a subgradient does not necessarily points towards a direction of 
descent. Thus subgradient methods are not monotonically decreasing, and their optimal 
solution is recovered from the algorithm sequence. 

It is convenient to deal with both learning tasks for structured predictors (i.e., struc- 
tured SVMs and CRFs) as two instances of the same framework. We follow the path of 
Pletscher et al.| ( [2010| ); |Hazan and Urtasun| (2010), and introduce a temperature parameter 



to our probability model, namely P[x,y){y''>^ h)- This parameter controls the variance of the 
probability distribution: it tends towards the uniform distribution when e ^ oc, and to the 
zero-one distribution when e ^ 0. We introduce a temperature extension of the log-loss 
function 

- def 

£^-logiw,x,y) = -elogp(^^^y){y;w/e). 

The extended log-loss generalizes the hinge-loss and the log-loss in the same way the norm 
function || • generalizes the sum- function || • ||i and the max-function || • ||oo- In particular, 
for e = 1 the extended log-loss reduces to the log-loss and for e = it reduces to the hinge- 
loss. Moreover, when e ^ the exnteded log-loss smoothly approximates the hinge-loss, in 
the same way the ^^/^-norm is a smooth approximation of the ^oo-norm. 

The gradient of the one-parameter extension of CRFs and structured SVMs is charac- 
terized by the gradient of the extended log-loss 

d £e-log{x,y,w) ^ / \A f -\ A f \ fQ\ 
= 2_^P{x,y){y'^^ hm{x,y) -0/e(x,y), (3) 

where p(^x,y){y'-) "w/e) is the Gibbs distribution over the possible labels y ^ y. When e ^ this 
probability distribution gets concentrated around its maximal values, since all its elements 
are raised to the power of a very large number (i.e., 1/e). For e = this distribution is 
supported on the maximal elements 3^*, and we attain a structured SVM subgradient. 
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2.2 Structured Prediction in Graphical Models 

In many real-life problems the labels y ^ y are n-tuples, y — {yi^ hence there are 

exponentially many labels in y. The features usually describe relations between subsets 
of elements r C {1, also called regions. We denote by TZk the regions of the feature 

(j)k{x^y). The features are functions of their regions labels yr C {yi, ...^yn}'- 

0/c(^,yi, .••,yn) = ^ (j)k^r{x,yr)' (4) 

Similarly, we consider region-based loss functions i{y,y) = ^reiZi'^riynyr)- The loss func- 
tion, as well as the features define hypergraphs whose nodes represent the n labels indexes, 
and the regions TZ = ^k^^k U T^i correspond to its hyperedges. A convenient way to repre- 
sent a hypergraph is by its region graph. A region graph is a directed graph whose nodes 
represent the regions and its direct edges correspond to the inclusion relation, i.e., a directed 
edge from node r to 5 is possible only if 5 C r. We adopt the terminology where P(r) and 
C{r) stand for all nodes that are parents and children of the node r, respectively. 



The Hammersley-Clifford theorem (e.g., Lauritzen (1996)) asserts that the Gibbs dis 



tributions P(^x,y){y'-> ^) defined in equation S corresponds to a Markov random field (MRF) 
whose statistical independencies are described by the joint hypergraph. These independen- 
cies are determined by the Markov property: Two nodes in the graph are conditionally 



independent when they are separated by observed nodes. Yedidia et al. (2005) show that 
whenever the region graph is bipartite and has no cycles, the Markov property provides 
a low dimensional representation of the Gibbs distribution using its marginal probabilities 

Pix,y){yr]w) = Ey\y,P(x,2/)(d;^), namely 

When the bipartite region graph has no cycles one can use the belief propagation algorithm 
to efficiently compute the marginal probabilities P(x,y){yr]'w/^)i for every e > 0, without 
performing exponentially many operations: 

Algorithm 1 Belief Propagation (product form) 

Set )Cr = {k:re TZk}. For every (x, y) set ^r(dr) = exp(^^(?/^, yr) + J2keJCr ^k,r{x, yr)). 
Repeat until convergence: 

rUa^iiyi) = (Ey, {i^a{ya)UjeC(a)\i^j^Mj)) ' ) 

rii^aiyi) oc i;i{yi) U^^p{i^\a^^^i(yi) 
Output: 

baiya) oc tpaiya) YlieC(a) ^i^cxiVi) 
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When restricting to bipartite region graphs it has two types of regions: outer regions, 
i.e., regions that are not contained by other regions, and inner regions. To distinct between 
these regions we denote outer regions by a and inner regions by i. The marginal probabihties 
P{x,y){yr'i'w) appear in the behefs hr{yr)- In general, when the region graph has cycles the 
belief propagation algorithm is not guaranteed to output the marginal probabilities. In 
some cases the belief propagation algorithm outputs beliefs hr{yr) which approximate well 
the marginal probabilities, while in other cases it produces non-accurate results or might fail 
to converge. Recently, there was an extensive effort trying to fix the drawbacks of the belief 
propagation algorithm, and convergence of belief propagation type algorithms is attained 



using techniques from convex duality, e.g., Heskes (2006); Hazan and Shashua (2010) 



Algorithm 2 Norm-Product Belief Propagation 



For every (x, y) set ^r(yr) = exp(^^(^^, yr) + Y.keiCr ^Kr{x, yr)). Set Q = Q + EaGP(i) 
Repeat until convergence: 



rria^iiyi) = [J2y^ {^a{ya)UieC(a)\i^j^M. 



jeC{a)\i 

Output: 



1/eci 



ba{ya) OC [^l^a{ya)UieC(a)^i^c,{y 



l/eca 



The norm-product algorithm reduces to belief propagation when setting its coefficients 
to = 1 — \P{r)\. These coefficients also appear when constructing a probability distribu- 
tion from its marginal probabilities in graph without cycles, and are known as the Bethe 
coefficients. We refer the interested reader to |Wainwright and Jordan] (|2008D; |Koller and 



Friedman (2009) for more details. 



The norm-product is guaranteed to converge whenever e, > and typically its beliefs 
approximate the marginal probabilities as well as the belief propagation approximations. 
Thus in its various forms it can be used to approximately learn the parameters of structured 
predictors as well as its gradient, described in equation ([s]). However, iteratively executing 
the norm-product algorithm as a sub-procedure to compute the gradient is computationally 
intractable and this method was not been widely used. In the following we explore the primal 
and dual aspects of learning structured predictors, showing that both CRFs and structured 
SVMs aim at matching the empirical moments ^(^^ y)eS ^(^^v) ^sing different selection 
rules. This provides with the means to efficiently learn the parameters of a graphical 
model, based on dual decomposition, which targets moment matching instead of the time 
consuming estimations of the probability P{x,y){y'->^ h)- 
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3. Convex Duality: Loss Minimization and Moment Matching 



CRFs and structured SVMs, as well as their one-parameter extension, are convex programs 
thus they have a dual program. Duality theory turned to be very effective in machine 
learning as it provides a principled way to decompose the different ingredients of the primal 
objective through its Lagrange multipliers. The dual decomposition in turn provides the 
means to efficiently estimate the different ingredients while maintaining their consistency 
using the dual objective. 

When dealing with convex programs one usually needs to consider the set of primal 
feasible solutions while constructing the dual function. We find it simpler to describe the 
primal program using extended real- valued convex functions, which are functions that can 
get the value of infinity. Intuitively, by using extended real- valued functions we can ignore 
their domains, i.e., points for which a function gets the value of infinity, thus simplifying 
the derivations. The dual programs of extended real valued convex functions g{/j.) are 
conveniently formulated in terms of their conjugate dual 

g*(z) = maxj/i^z 

Throughout this work we use the following duality theorem, known as the Fenchel duality 



(cf. [Fenchel (1951); Rockafellar (1970); Bertsekas et al. (2003)): 



Theorem 1 Let fs : MX R and /i^ : R ^ R 6e extended real-valued and convex functions, 
and let ag^t^ds be vectors of length Y, for every s. The following are primal and dual 
programs: 



(Primal) mm ^fs^Yl ^^^^'^ + ~ ^ ^ ht{-ut) 

s t t 

(Dual) max - ^ + pj Qs) -"^K^Yl ^^tPs - 



Strong duality holds if the functions satisfy fs{iJis)^ht{i^t) > —oo, their domains are defined 
with linear equalities and inequalities, they are continuous on their domains, their domains 
intersect and the primal optimal value is finite. 

Proof: We use Lagrange duality theorem, minimizing the function /s(Ms + 9s) — dJ^ + 
J2t ht{—^t) subject to the constraints /is(y) = vtag^tiy)- These equality constraints hold 
for every y — 1, ...,y, therefore correspond to Lagrange multipliers Ps G R^, for every s. 
The Lagrangian takes the form 



By minimizing with respect to the primal variables min^^^^ L(/i, w^p) we get the dual function 
above. Strong duality holds by Theorems 6.2.5, 6.4.1, 6.4.2 in Bertsekas et al.| ( |2003[ ) Q 

The above duality theorem describes the relations between two types of functions through 
their conjugate dual functions. The parameter learning problem in equation ([!]) consists 
of two such functions, the loss function and the regularization. The extended log-loss is 
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dominated by the normalizing constant of its Gibbs distribution, thus its conjugate dual is 
the entropy barrier function. The dual variables are then probability distributions P(x,y) (v) 
and the dual program maximizes their entropy. The regularization consists of the square 
function, which is its own conjugate dual, therefore the learning dual tries to match the 
empirical moments J2{x y) ^k(x^y) using these probabilities. Hence the dual program for 
learning with extended log-loss balances between maximizing the entropy barrier function 
and fitting the moment matching constraints. 

Corollary 2 Let Ay be the probability simplex, i.e., the set of probability distributions over 
y. Define the entropy as a barrier function over the probability simplex, 



eH{p) 



-^J2yP{y)^oE>p{y) if p^^y 

-oc otherwise 



Then following are primal and dual programs. 



C 

(Primal) min l^-iog{w,x,y) ^ —\\w\\l 

in ^ — ' ^ ^ 



(Dual) 



(x,y)eS 



2C\ 



{x,y)es yey 



In particular, £^-iog{w, x,y) and \\w\\2 satisfy the conditions in Theorem^ therefore strong 
duality holds. 



Proof: Set Z^{w^x^y) to be the normalizing constant of p(^rc^y){y'-,w/e). Thus the ex- 
tended log-loss equals to \ogZ^{w^x^y) — ^j^Wk(l)k{x,y). Thus the proof follows from 
Theorem [l] when setting s = and v — where the index t is equivalent to 

the feature index k. Therefore as^k{y) = (l>k{x,y), fs{J2k^kCis,k + 9s) = log Ze{w, x,y), 
gs{y) = dk = J2{x,y)^k(x,y) and hk{-Wk) = wl, while noticing that the conjugate 

dual of log Ze{w,x,y) is eH{p(^rc^y)) (e.g., Wainwright and Jordan (2008) Theorem 8.1) and 



the conjugate dual of hw^ is hz^ (e.g., 



Rockafellar 



(1970), page 106). □ 



Both the primal and dual programs are well defined for C = 0, where the primal 
regularization does not exist and the dual program enforces the moment matching as hard 
constraints. Generally, the parameter C balances between the extended log-loss and the 
regularization. From dual perspective, the parameter C balances between the entropy 
barrier function eH{p) and the moment matching constraints. One can observe that also 
the parameter e balances between the entropy barrier and the moment matching. When 
considering learning with extended log-loss, restricted to l{y^ y) = 0, the parameter e affects 
the solution as (7, since the dual program can be equivalently written as 



= 



6- max H{p^^^y^)- 

{x,y)eS 



1 I 

2C~e\ 



E 

{x,y)eS 



Z^P(x,y) 

yeY 



{y)^{x,y)-^{x,y) 
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Restricting the dual program in Corollary [2] to e = 1, it describes the well-known duality 
relation between the log-likelihood and the entropy, that is used in the context of CRFs 
by Lebanon and Lafferty (2002). When e = we obtain the known dual formulation of 
structured SVM which emphasizes the duality between the max-function and the prob- 
ability simplex ( |Taskar et al] ( |2004D ; [Tsochantaridis et al.| ( |2006D ; [Collins et all ( |2008D ). 
Thus, the seemingly different frameworks of CRFs and structured SVMs share the same 
moment matching perspective, and only differ by the selection rule for their probability 
distributions. Since these two formulations were proven to be successful in many cases of 
interest, we conclude that moment matching is important for learning the parameters of 
structured predictors . Nevertheless, when considering structured labels y — 
the computational complexity of moment matching is exponential in n, for general graph- 
ical models. In the following we describe how to achieve moments matching using pseudo 
marginal probabilities and their corresponding low dimensional extended log-loss functions. 



4. Pseudo Moment Matching and its Loss Formulation 

Considering structured labels y — (yi, ...^yn) over region graphs, the primal and dual learn- 
ing programs complexities are exponential in n, as one needs evaluate the extended log-loss 
or the entropy barrier function respectively. However, the moment matching constraints, 
appearing in the dual program, are low dimensional and depend on the size of the regions. 
Thinking about the entropy barrier function as a selection rule that is independent of the 
moment matching constraints, we can reduce the complexity of the dual program. We 
match the moments with pseudo marginal probabilities, i.e., beliefs, while applying a low 
dimensional selection rule replacing the entropy barrier function. 

Restricting ourselves to graph based features, defined in equation Q, the moment 
matching constraints in the dual program of Corollary [2] are taken with respect to the 
marginal probabilities, namely 

^P{x,y){y)(i)k{x,y) ^ P(x,y){yr)(l)k,r{x,y). 

yey reUyreyr 
Thus the effective complexity of the moment matching constraints is the number of labels 
in a regions, namely ^ 3^r- These averages can also be computed with beliefs &(x,2/),r(2/r)5 
i.e., probability distributions over regions labels that not necessarily come from a consistent 
distribution over all labels. To enforce local consistency between these averages we require 
these beliefs to agree on their overlapping labels. The selection rule we propose is the 
entropy barrier function for these beliefs. Thus, the pseudo (dual) moments matching and 
its (primal) low dimensional formulations for the extended log-loss take the following form: 



Theorem 3 Consider region based features, defined in equation ^ and their corresponding 
region graphs. Assume the loss function decomposes with respect to its regions M^. Let fCr 
to he the set of {k : r ^ TZk} and denote by P(r) and C(r) the parents and children of a 
region in the joint region graph. Consider, for every {x,y) G 5,r G 7l,p G P{r), the real 
valued vector ^{x,y),r^p(yr) ct^^d define the parameterized beliefs 

hi^^^y)^r{yT]W,\) (Xeyi^(^lr{yT,yT)+^ ^ \x,y),c^r{yc)- \x,y),r^p{yr) 

keJCr ceC(r) peP{r) 
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Define the low dimensional entended log-loss as ir,e-log{^i ^5 Vr) — — ^log ^(x,y),r (^Vr'i^/^^ • 
Then following are primal and dual programs. 



C 

(Primal) min y^Jr,e-log{w,x,yr) + —\\w\\l 



(Dual) max V V (ei7(6(^^^)^^) + V 



2C 

k (x,y)eS relZkyreyr 



subject to \f{x,y),r,yr,p G P(r) \x,y)Ayr) = Y hx,y)Ayp) 

yp\yr 

Strong duality holds since £r,e-log{^i x^yr) and \\w\\2 satisfy the conditions in Theorem^ 

Proof: The proof follows from Theorem [TJ while we derive the primal program as the dual 
of the dual program. However, the indexing is a bit more involved. The index 5 relates to 
the triplets of indexes (x, y), r, thus gsiVr) — ^riVn Vr)- The index t either corresponds to a 
moment matching constraint index or to a marginalization constraint index (x, y), r, yr^P- 
For t — k we use as^tiVr) — 0/c,r(^5 Vr)^ and for t = (x, y),r,yr,p we enforce marginalization 
constraints by setting ag^tiVp) = 1 if 5, t agree on the parent index in t and yp contains y^, and 
cis,t{yr) — —1 if 5, t agree on the child index in t. When t — k we set dk — X^^^, y^^^s ^k{x, y) 
and zero otherwise. When t = k we set h^{z) = ^z^, whose conjugate dual is ht{w) = 
For t = {x^y)^r^yr^p we set h^{0) = and h^{z) = oc otherwise, whose conjugate dual 
ht{X) = 0. Since /*() is the entropy barrier function its conjugate dual is the normalizing 
constant of 6(3,^^) it;/e, A/e). We thus arrive to the final primal form, by adding to the 
linear term d^u the quantity 



\x,y),r^p{yr) \x,y),c^riyc) = 



which creates the numerator of the parametrized beliefs in the extended log-loss by multi- 
plying with e/e and exponentiating while taking the logarithm. [] 

Comparing the exteded log-loss formulation in Corollary [2] to the low dimensional for- 
mulation in Theorem |3] we conclude that the difference between these two programs is in 
their probability models. In the low dimensional formulation we fit parameters w to beliefs 
b(^x^y^ j.{yr'-,w,X) whose local consistencies are governed by the variables A. Using strong 
duality we are able to guarantee that the optimal beliefs are consistent with each other, 
since A are Lagrange multipliers of the marginalization constraints in the dual program. 

The connections of the primal variables A to the dual marginalization constraints sug- 
gest that the primal formulation in Theorem |3] is the objective function for approximate 
heuristics that recover parametrized beliefs that agree on their marginal probabilities, that 
are described in Section [2^ These heuristics require running the norm-product belief prop- 
agation to convergence in order to obtain beliefs that agree on their marginal probabilities 
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for updating the weights thus they are computationahy intractable. Our practical goal 
in this work is to overcome this computational difficulty and efficiently optimize this objec- 
tive in large graphical models. For this purpose we show how to update the parameters w 
using beliefs that do not necessarily agree on their marginal probabilities throughout the 
algorithm run-time, but only when it converges. 

Strong duality holds for the pseudo moment matching and its corresponding low di- 
mensional extended log-loss formulation. Therefore, one can either minimize the primal or 
maximize the dual to get the same results. Nevertheless, there are computational differences 
between these programs. The dual program is constrained and requires (sub)gradient de- 
scent methods that consider all variables. In contrast, the primal program is unconstrained, 
and one can perform block coordinate descent on its variables. Coordinate descent methods 
are appealing as they optimize small number of variables while holding the rest fixed, there- 
fore they can be performed efficiently and can be easily parallelized. Moreover, coordinate 
descent for the primal program in Theorem [3] can be performed by sending messages over its 
region graph, thus can be efficiently applied to learn parameters of large graphical models. 

Performing block coordinate descent on the primal objective in Theorem [3] requires 
minimizing a block of variables while holding the rest fixed. We begin by describing how to 
find the optimal set of variables X(x,y),r^p{yr) that are related to a region and its parents 
in the graphical model. 

Lemma 4 For every (x, y) set 4>r{yr) — f-riVn Vr) + J2keK: ^k,r{x^ yr), where Kr — {k \ r ^ 
TZk}- For every region r, the optimal )^(x,y),r^p(yr) for every p G P{r)^yr G 3^r5 {x^y) G S 
in the primal program of Theorem\^ satisfy 

^{x,y),p^r (Vr) — ^log ^ exp + ^ \x,y),c^p{yc) - ^ \x,y),p^p' (yp))/ 

yp\yr ceC(p)\r p'^p{p) 

\x,y),r^p{yr) = 1 + \P(r) \ {^"^^^"^^ ^ \x,y),c^r{yc) + ^ ^^{x,y),p' ^ri^r , 

' ^^^^ cGC(r) p'eP{r) 

Moreover, since the program is not strictly convex, the optimal solutions can be achieved for 
every additive shifts, namely ^(x,y),r^p{yr) — c are also optimal solutions for every constant 
c. 

Proof: The primal program in Theorem |3] is convex and unconstrained, therefore the 
optimum is achieved when the (sub)gradient vanishes. For e = define 6(3,^^) it;/e, A/e) 
to be a probability distribution over the maximal elements = argmax^^^-y^{^^(^^, y^) + 
J2keJCr ^k4^k,r{x^yr)}. Then the gradient with respect to )^(x,y),r^p(yp) takes the form 

d 

{x,y),T^p\yr) .^^.^ 

The optimal dual variables are those for which the gradient vanishes, i.e., the correspond- 
ing beliefs agree on their marginal probabilities. When setting l^(x,y),p^r{yr) as above, the 
marginalization of b(^x y^ p(yp]w,X) satisfy 

x,y),r^p iVr))/ 

yp\yr 
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Therefore, by taking the logarithm, the gradient vanishes whenever the behefs numerators 
agree up to an additive constant 

x,y),c^r{yc) ^ ^ ^{x,y),r^p{yr) 
ceC{r) peP(r) 

Summing both sides with respect to G P{r) we are able to obtain 

\x,y),r^p'{yr) = ^ \P(r)\ {^^"^^^"^^ ^ ^(x,y),c^r{yc) + ^ ^{x,y),p' ^r{yr))^ 

Plugin it into the above equation results in the desired block dual descent update rule, i.e., 
^{x,y),r^p(yr) foi" which the partial derivatives vanish. Q 

The above lemma describes an analytic solution for the optimal \x,y),r^p{yr)i that are 
computed in the block coordinate steps of the algorithm. In practice, block coordinate de- 
scent with analytic steps provides a significant speedup over conventional gradient methods 



and can be parallelized and distributed easily, as shown by [Schwing et al. (2011). Unfor 



tunately, we are not able to analytically find the optimal Wk while holding the rest fixed, 
thus we perform a step in the direction of the negative (sub) gradient. 

Lemma 5 The (suh) gradient of the approximated structured prediction program in Theorem 
with respect to Wr equals to 

X] XI ( XI b{x,y),r{yr;w/^A/^)(pk,r{x,yr) - (f^kA^^yr)^ + Cwk^ 

{x,y)eSren yreyr 



Proof: Recall the definition of b(^x,y),r{yr'i^/^, e > 0, in Theorem|3j For e = we 

use its definition in Lemma [i] and Danskin theorem. Q 

The computational complexity of the gradient depends on the structure of the features, 
namely the number of regions and their labels. Therefore our framework prefers features 
with small regions and reasonable number of labels Another computational issue relates 
to the step size 77 for optimizing Wk- In general, the coordinate descent scheme verifies 
that the chosen step size 77 reduces the objective. Theoretically, we can use the fact that 
the gradient is Lipschitz continuous to predetermine a step size that guarantees descent. 
However, in practice it gives worse performance than searching for a step size, dividing 77 
by a constant factor until descent is guaranteed. 

Lemmas |4] and [5] describe our algorithm for learning the parameters that minimize the 
low dimensional extended log-loss formulation. Theoretically the order of the minimization 
steps is not important since as long as all parameters are optimized the minimal value is 
attained. For example, one can minimize the A variables till they do not change before 
optimizing the w. Since the A messages enforce marginalization constraints this amounts to 
finding parametrized beliefs h(^x,y),r{yr'i'^ / ^1^/ ^) that agree on their marginal probabilities. 
This approach is equivalent to performing the approximate inference heuristic described 
in Section |2.2| thus our low dimensional structured prediction in Theorem [3] provides the 
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Message-passing algorithm for low dimensional structured prediction 

Consider the primal program in Theorem Isl For every initial values of messages n(^x,y),r^p{yr 



1. Repeat until convergence: 

2. For every (x, y) e S, r e TZ, yr ^ 3^r, P ^ P{r): 

Set t/jr{yr) = e^p{£r{yr, yr) + Y^k^lCr ^Kr{x, Ijr))- 



yp\yr ceC(p)\r p'^p{p) 



cGC(r) p'^P{r) 



l/(l+|P(r)|) 



3. Set b^^^y)^r{yr; W, A) OC j/jriVr) UceC(r) ^{x,y),c^r{yc) UpeP(r) ^{ly),T^p^y^)' 



Figure 1: The product form of the block coordinate descent algorithm for the primal pro- 
gram in Theorem|3j as described in lemmas[4j[5j The step size 77 is set to guarantee 
convergence (e.g., corresponding to the Lipschitz constant or the Armijo rule.) 



objective function for this heuristic. However, this heuristic is computationally intractable 
as it requires to optimize A till convergence for every descent step of w. Since our algorithm 
does not depend on the order of the minimization steps, it also provides a principled method 
to optimize the w using beliefs that do not agree of their marginal probabilities. For this 
purpose our algorihtm computes the parametrized beliefs differently than the (outer) beliefs 
that are computed by the approximate inference heuristics. This property is important in 
practice, since in the beginning of the algorithm runtime, where the given w are far from the 
optimum, one needs not spend time on computing consistent beliefs. Figure [l] summarizes 
the algorithm in its product form, setting n = exp(A) and m = exp(/x). 

The block coordinate descent algorithm is guaranteed to converge, as it monotonically 
decreases the primal program in Theorem |3) which is lower bounded by its dual. However, 
convergence to the global minimum cannot be guaranteed in all cases. In particular, for 
6 = coordinate descent on the low dimensional structured SVM program is not guaranteed 
to converge to its global minimum. To converge to the global minimum in this case one can 
use subgradient methods, but despite their theoretical guarantees they turn to be slow in 
practice. Since the primal program is not strictly convex in A, even when we are guaranteed 
to converge to the global minimum, when e > 0, the sequence of variables X(x,y),r^p(yr) 
generated by the algorithm is not guaranteed to be bounded. As a trivial example, adding 
an arbitrary constant to the variables, \x,y),r^p{yr) + c, does not change the objective 
value, hence the algorithm can generate monotonically decreasing unbounded sequences. 
However, the beliefs generated by the algorithm are bounded and guaranteed to converge 
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to the unique solution of the dual program. The convergence properties of the algorithm 
are summarized in the following claim. 

Theorem 6 The block coordinate descent algorithm in lemmas\^and\^monotonically de- 
creases the primal program in Theorem\^ therefore the value of its objective is guaranteed 
to converge. Moreover, if e > 0, then the value of its objective is guaranteed to converge 
to the global minimum, and its sequence of beliefs are guaranteed to converge to the unique 
solution of the dual program. 

Proof: Whenever e > 0, the dual objective is strictly concave in bx^y^riVr), subject to 
linear marginalization constraints and the linear moment constraints 



Hence the claim properties are a direct consequence of Tseng and Bertsekas (1987) for this 
type of programs. Q 

The convergence of the block coordinate descent depends on the step size 77, which 
requires to reduce the objective. This can be done by the Armijo rule, or by using the fact 
that the function is strongly convex ( e.g., [Tseng and Bertsekas (1987)) and its gradient 
is Lipschitz continuous (e.g., Nesterov] ( |2004[ )). In practice. Theorem [6| describes how to 
measure the convergence of the algorithm, either by the primal objective, the dual objective 
or the beliefs. 



5. Extensions: Entropies, Regularizations and the Penalty Method 

Learning with low dimensional extended log-loss consists of two conjugate dual functions, 
one fits the moments and the marginalization constraints and the other provides a selec- 
tion rule. Using these functions we are able to solve it efficiently with a message-passing 
algorithm. In this section we extend our framework while maintaining the computational 
efficiency of our message-passing algorithm. 

We fit the moments using the square function. Since this function is strictly convex, its 
conjugate dual is smooth, thus we are able to perform a gradient descent step to optimize 
w. The selection rule we apply consists of entropies over region labels. This selection rule 
is frequently referred as an entropy approximation since it replaces the entropy function 
over all labels. Whenever the region graph is bipartite and without cycles, the Gibbs 
distribution can be described by its marginal probabilities p(yr) = ^y\y^p{y)i as described 
in equation ([5]). Therefore the entropy can be equivalently described by a weighted sum of 
local entropies H{p) = X]^(l — \P{r)\)H{p{yr))^ called the Bethe entropy. More generally, 
one can use fractional entropy approximation 

H{p)^J2''rH{piyr)). (6) 

r 

The introduction of general functions for fitting the constrains and fractional entropy ap- 
proximations selection rules to learn structured predictors parameters with low dimensional 
loss functions provides the following primal and dual programs: 
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Theorem 7 Consider the setting of Theorem For > strong duality holds for the 
following primal and dual programs. 

(Primal) min y^ir,ecr-log{w,x,yr) + y^^hi{-Wk)+ /12(A) 

w,X 

(Dual) max V V ( ec^i7(6(^,^),^) + V 

'^--^-^^^ix,y)esren^ yr ^ 

Yl {Yl J2 h^^y)Ayr)^k,r{x,yr)-(j)k{x,y) 
k (x,y)eS relZk yr^yr 

J2 J2 J2 J2 ^2{J2h^^y)Ayp)-hx,y)Ay' 

(x,y)eS ren yreyr peP(r) yp\yr 

Proof: The proof follows the same lines as Theorem |3) while the differences are in the 
functions fs{-) and ht{-). We use the triplets 5 = ((x,y),r) and fs{b) is the log-partition 
function of b(^x y^ ,^{yr] w/eCr, X/eCr). For notional convenience we use the same regularization 
function for moment matching, i.e., hi(w), and for marginalization constraints, i.e., /12(A). [] 

The above formulation describes a duality relation between the penalty functions h^{-), 
for fitting the moments and marginalization constraints, and the regularization ht{-). Thus, 
one can use different penalty functions to influence the properties of the primal program. 
For example, one can use known relations between the weight of the penalty function to 



deduce the appropriate (dual) weight of the regularization function (cf. Bertsekas et al. 



(2003) Section 5.5). Also, one can choose the penalty function to be strongly concave 
with Lipschitz continuous gradient, for which the primal block gradient descent has linear 
convergence rate. 

The introduction of fractional entropy approximations also provides the means to control 
the properties on the primal and dual programs. For example, for pairwise region graphs one 
can use the tree re-weighted coefficients c^- ^ to upper bound the true entropy function, 
and the dual program is concave whenever the marginalization constraints are satisfied 



(Wainwright et al. ( 2005| )). At the optimum, these entropies translate to an extended 



log-loss that upper bound the true log-loss. For general region graphs, one can obtain 
upper bounds with non-negative region weights > that cover the different nodes in 
the graphical models, namely J2r-ier^r ^ 1 (Friedgut (2004); Madiman and Tetali ( 2010^ ). 



In this setting the dual program is everywhere concave thus the entropy upper bounds 
translate, through strong duality, to upper bounds on the extended log-loss in the primal, 
which we aim at minimizing. 

Using the fractional entropy approximations as our selection rule does not affect the 
computational properties of the programs, since its maximizing arguments, and hence the 
gradient of the extended log-loss, are beliefs which are (weighted) Gibbs distributions. The 
Gibbs distributions are computationally favorable since they have a log-linear form (e.g., 
equation ([2])) thus they provide an analytical block coordinate update rule, regardless of 
the the weights c^. Moreover, since the block coordinate descent iterates over points with 
vanishing gradients it can also explain algorithms for optimizing weights with mixed signs, 
as happens with the Bethe entropy and the tree re-weighted entropy. Therefore we are able 
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to extend the low dimensional learning framework while providing efficient message-passing 
algorithms: 

Theorem 8 Consider the primal and dual programs in Theorem^ with smooth regulariza- 
tion hk(—Wk), ht{X) = 0^ and the following algorihtm: 

x,y),p^p' 

yp\yr ceC{p)\r p'eP{p) 

iVr) = ZV^^^ (0r(^r)+ ^ \x,y),c^r{yc) + ^ f^(x,y),p'^r{yr)j 

{x,y)eSren yr 

Whenever e, > these update rules monotonically decrease the primal objective, thus 
they are guaranteed to converge. Whenever e, > the beliefs generated by this algorithm 
converge to a dual optimal solution and the primal and dual objectives converge to their op- 
timal values. Moreover, if A converge, then their limit point is a primal optimal solution. 
Whenever e, ^ the algorithm is not guaranteed to converge, but whenever it converges 
it recovers stationary points for both programs. 

Proof: The proof follows the ones in Lemmas [4| [5] with some modifications. The primal 
program is unconstrained, therefore we describe the points for which the gradient vanishes. 
The gradient with respect to Xi^x y^ ^^piVp) relate to the disagreements of the marginal beliefs 
J2yp\yr '^/^^^^ X / cCr) — b(^x,y) .riVri w/cCr^ X/cCr). Thus the gradient vanishes when 

f^{x,y),p^r iVr) + X^^^y)^r^p{yr) MVr) + EcGC(r) \x,y),c^r{yc) " EpGP(r) \x,y).r^p{yr) 



Multiplying both sides by CrCp and summing both sides with respect to p' G P(r) we are 
able to obtain 



\x,y),r^p^{yr) = (0r(^r)+^ \x,y),c^r{yc)+ ^ l^{x,y),p' ^r{yr)) 

p'eP{r) ^ 1^p'^P{t) V ^^c{t) p'^P{t) 

Plugin it into the above equation results in the desired update rule, i.e., X(^x,y),r^p(yr) for 
which the partial derivatives vanish. The convergence for e, > is guaranteed since 
the primal program is lower bounded by its dual. The optimality results for e, > are 



achieved by applying Tseng and Bertsekas (1987). Whenever e, ^ if the algorithm con- 



verges the primal gradient vanishes thus it recovers a primal stationary point. Considering 
the Lagrangian of the dual, given the Lagrange multipliers A, w their corresponding beliefs 
b(x,y),r{yr'i'^/^Cr,X/eCr) Satisfy the marginalization constraints, therefore we also recover a 
stationary point for the dual. Q 

The above theorem generalizes the message-passing algorithm for learning the param- 
eters of structured predictors using low dimensional extended log-loss in Fig. [ij as it 
introduces the regions weights c^. This provides a way to explain different heuristics for 
learning structured predictors in graphical models. For example, setting the Bethe coeffi- 
cients Cr = l — \P{r) \ amounts to minimizing a non-convex program using belief propagation 
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Figure 2: Gaussian and bimodal noise: Comparison of our approach to loopy belief 
propagation and mean field approximations when optimizing using BFGS, SGD 
and SMD. Note that our approach significantly outperforms all the baselines. 
MF-SMD did not work for Bimodal noise. 



to approximate the marginal probabilities. Whenever the solver converges we are able to 
match the moments and expect the resulting weights to be good in practice. This can ex- 
plain the success of belief propagation when applied as a heuristic to estimate the marginal 
probabilities, since when it converges its beliefs agree on their marginal probabilities. How- 
ever, since non-convex programs are harder to optimize, the algorithm might not result in 
beliefs that fit the moments and agree on their marginal probabilities. This in turn may 
explain the failures of beliefs propagation heuristics as it is not guaranteed to converge and 
its resulting beliefs not necessarily agree on their marginal probabilities. 



6. Experimental evaluation 



In this section we evaluate our approach in a wide range of computer vision problems 
including de-noising, stereo estimation, semantic segmentation, shape reconstruction and 3D 
indoor scene understanding. Our approach enables us to learn a large number of parameters 
efficiently and results in state-of-the-art performance in all these tasks. A more detailed 



version of these results can be found in Salzmann and Urtasun (2012); Yamaguchi et al. 
([20T2I); [Sd^ing et aL| ([20121); |Yao et al] (|2012D. 



6.1 Image Denoising 

We performed experiments on 2D grids since they are widely used to represent images, 
and have many cycles. We first investigate the role of e in the accuracy and running time 
of our algorithm, described in Fig,. [T] We used a 10 x 10 binary image and randomly 
generated 10 corrupted samples flipping every bit with 0.2 probability. We trained the 
model using e — {1,0.5,0.01,0}, ranging the low-dimensional extended log-loss from e — 1 
(low dimensional CRFs) to e = (low dimensional structured SVM) and its smooth version 
(e = 0.01). The runtimes are 323, 324, 326, 294 seconds for e = {1, 0.5, 0.01, 0} respectively. 
As e gets smaller the runtime slightly increases, but it decreases for e = since the ^00 norm 
is efficiently computed using the max function. However, for e = it is hard to determine 
the optimality as the max-function is non-smooth thus a dual solution is not uniquely 
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recovered. When the approximated structured SVM converges, the gap between the primal 
objective and dual objective was 1.3, while the dual beliefs were recovered according to the 
subgradient, i.e., the maximal arguments. In contrast, for e > the primal-dual gap was 
10~^, while the the dual beliefs were uniquely recovered using the gradient. 

We generated test images in a similar fashion. When using the same e for training and 
testing we obtained 2 misclassifications for e > and 109 for e = 0. We conjecture that 
this comes from the existence of multiple maximal arguments in the primal objective when 
e = 0, or equivalently from its non-smooth corners. We also evaluated the quality of the 
solution using different values of e for training and inference, following Wainwright (2006). 
When predicting with smaller e than the one used for learning the results are marginally 
worse than when predicting with the same e. However, when predicting with larger 6, the 
results get significantly worse, e.g., learning with e = 0.01 and predicting with e = 1 results 
in 10 errors, and only 2 when e = 0.01. 

The main advantage of our algorithm is that it can efficiently learn many parameters in 
a graphical model. We now compared, in a similarly generated dataset of size 5 x 5, a model 
learned with different parameters for every edge and vertex 300 parameters) and a model 
learned with parameters shared among the vertices and edges (2 parameters for edges and 2 
for vertices) used by Kumar and Hebert (2003). Using large number of parameters increases 
performance: sharing parameters resulted in 16 misclassifications, while optimizing over the 
300 parameters resulted in 2 errors. WE note that our algorithm avoids overfitting in this 



case. 



We now compare our algorithm in Fig. [T] to standard CRF solvers that use different 



approaches to compute the gradient. We use the binary image dataset of [Kumar and Hebert 
( |2003| ) that consists of 4 different 64 x 64 base images. Each base image was corrupted 50 



times with each type of noise. Following Vishwanathan et al. (2006), we trained different 



models to denoise each individual image, using 40 examples for training and 10 for test. 
We compare our approach to the result of approximating the conditional likelihood using 
loopy belief propagation (LBP) and mean field approximation (MF). For each of these 
approximations, we use stochastic gradient descent (SGD), stochastic meta-descent (SMD) 
and BFGS to learn the parameters. We do not report pseudolikelihood (PL) results since 
it did not work. Note that the same behavior of PL was noticed by [Vishwanathan et"aL 
(2006). To reduce the computational complexity and the chances of convergence, 



Kumar 



and Hebert| ( |2003[ ); [Vishwanathan et al.| ( [2006 ) forced their parameters to be shared across 
all nodes such that Mi^Oi = ^^^^ and Vi,Vj G N{i)^ Oij — 9^. In contrast, we can exploit 
the full flexibility of the graph and learn more than 10, 000 parameters. Note that this is 
computationally prohibitive with the baselines. For the local features we simply use the 
pixel values, and for the node potentials we use an Ising model with only bias features such 
that = [1, —1; —1, 1]. For all experiments we use e = 1. For the baselines, we use the 
code, features and optimal parameters of Vishwanathan et al. (2006). 

Under the first noise model, each pixel was corrupted via i.i.d. Gaussian noise with 
mean and standard deviation of 0.3. Fig. [2] depicts test error in (%) for the different 
base images (i.e., /i, . . . ,14). Note that our approach outperforms considerably the loopy 
belief propagation and mean field approximations for all optimization criteria (BFGS, SGD, 
SMD). For example, for the first base image the error of our approach is 0.0488%, which 
is equivalent to a 2 pixels error on average. In contrast the best baseline gets 112 pixels 
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Figure 3: Denoising results: Gaussian (left) and Bimodal (right) noise. 
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Figure 4: Convergence. Primal and dual train errors when for Ii is corrupted with Gaus- 
sian and Bimodal noise. Our algorithm is able to converge in a few iterations. 



wrong on average. Fig. [s] (left) depicts test examples as well as our denoising results. Note 
that our approach is able to cope with large amounts of noise. 

Under the second noise model, each pixel was corrupted with an independent mixture 
of Gaussians. For each class, a mixture of 2 Gaussians with equal mixing weights was used, 
yielding the Bimodal noise. The mixture model parameters were (0.08, 0.03) and (0.46, 0.03) 
for the first class and (0.55, 0.02) and (0.42, 0.10) for the second class, with (a, b) a Gaussian 
with mean a and standard deviation b. Fig. [2] depicts test error in (%) for the different base 
images. As before, our approach outperforms all the baselines. We do not report MF-SMD 
results since it did not work. Denoised images are shown in Fig. [3] (right). We now show 
how our algorithm converges in a few iterations. Fig. [4] depicts the primal and dual training 
errors as a function of the number of iterations. Note that our algorithm converges, and 
the dual and primal values are very tight after a few iterations. 
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6.2 Stereo estimation 



The problem of stereo estimation consists of two images of a scene, for which we wish 
to calculate the depth for each pixel in these images. Assuming that the cameras are 
calibrated and the images are rectified, the problem can be reduced for each pixel to a 1-D 
search along the corresponding epipolar line. Over the past few decades we have witnessed 
a great improvement in performance of stereo algorithms. Most modern approaches frame 



the problem as inference on a graphical model. Most methods Birchfield and Tomasil 



Hong and Chen (2004); Bleyer and Gelautz (2005) 


Yang et al. 


(2008 


); Trinh and McAllester 


(2009) 



( 


1999 




2005 



reference image, say the left image, and model the surface under each superpixel as a 
slanted plane. The graphical model typically has a robust data term scoring the assigned 
plane in terms of a matching score induced by the plane on the pixels contained in the 
superpixel. This data term often incorporates an explicit treatment of occlusion — pixels 
in one image that have no corresponding pixel in the other image Zitnick and Kanade 



(2000); Kolmogorov and Zabih (2002); Deng et al. (2005); Bleyer et al. (2010). Slanted- 



plane models also typically include a robust smoothness term expressing the belief that 
the planes assigned to adjacent superpixels should be similar. Despite recent advances in 
learning graphical models, most approaches hand tuned their parameters. 



In recent work Yamaguchi et al. (1998), we have proposed an approach to stereo estima- 
tion that is computational efficient in both learning and inference. We incorporate a better 
model of occlusion than existing approaches by modeling explicitly occlusion boundaries 
between adjacent superpixels. This allow us to incorporate potentials that reason about 
boundary ownership as well as whether junctions are physically possible. We now briefly 
discuss the graphical model as well as the potentials we employ. 

We represent the stereo estimation problem as the one of inference in a hybrid Markov 
random field that contains a mixture of discrete and continuous random variables. The 
continuous random variables represent, for each segment, the disparities of all pixels con- 
tained in that segment in the form of a 3D slanted plane. The discrete random variables 
indicate for each pair of neighboring segments, whether they are co-planar, they form a 
hinge or there is a depth discontinuity (indicating which plane is in front of which). Let 
Oij G {co, hi, lo, ro} be a discrete random variable representing whether two neighboring 
planes are coplanar, form a hinge or an occlusion boundary. Here, lo implies that plane i 
occludes plane j, and ro represents that plane j occludes plane i. We define our hybrid 
conditional random field as follows 

where y represents the set of all 3D slanted planes, o the set of all discrete random variables, 
and il^r encode potential functions over sets of continuos, discrete or mixture of both 

types of variables. Note that y contains three random variables for every segment in the 
image, and there is a random variable Oi j for each pair of neighboring segments. 



We now briefly describe the potentials we employ, and refer the reader to |Yamaguchi| 
et al. (2012) for more details. We utilize a truncated quadratic disparity potential, 0^^^(y^), 



which encodes that the plane should agree with the results of the matching along the epipolar 
lines. We additional incorporate 3-way boundary potentials {oij^yi^yj) linking our 



20 



Efficient Learning in Graphical Models 



discrete and continuous variables. In particular, these potentials express the fact that 
when two neighboring planes are hinge or coplanar they should agree on the boundary, 
and when a segment occludes another segment, the boundary should be explained by the 
occluder. We impose a regularization on the type of occlusion boundary, where we prefer 
simpler explanations (i.e., coplanar is preferable than hinge which is more desirable than 
occlusion). This is encoded in 0^J^-^^(o^j, y^, yj). We introduce a potential ^^^^(yfront? Yback) 
which ensures that the discrete occlusion labels match well the disparity observations, as 
well as an additional potential (l\J^{yi) which penalizes negative disparities. Following work 
on occlusion boundary reasoning |Malik| ( [1987| ); iHoiem et al.| pOO?! ), we utilize higher order 
potentials to encode whether junctions of three and four planes are possible. This is encoded 
>J^* fr.. . ^ ^ g^j^(j (t)ZZ.(opq^ Oqr^ Ors^ Ops) respectively. Finally, we employ a simple color 



m 



Ojk^ Oik ) ^IICL (p^q^g \^pq , ^qr 5 ^rs 5 ^ps J 

potential to reason about segmentation, which is defined in terms of the x-squared distance 
between color histograms of neighboring segments. This potential, (f)^j^{oij)^ encodes the fact 
that we expect segments which are coplanar to have similar color statistics (i.e., histograms), 
while the entropy of this distribution is higher when the planes form an occlusion boundary 
or a hinge. Fig. [5] (left) illustrates the graphical model. Thus our hybrid graphical model 
is defined in terms of the following energy function 

|y| 



w 



bdyl 



<^^'(o.„y^,y,)+ E ^'''C'(yfront,yback) 

{i,j)e£bdy 



+ 



+ 



J2 ^ 

ijes 



bdy2, 



,bdy2 
ij 



iOij,yi,yj[ 



|y| 



+ 



(ij,k)eSjct 



ct /jet 



^ijki^ij^Ojk.Oik) 



E 

(p,g,r,s)efc 



crs 
^pqrs 



(Opq^ Oqri Ofsi Ops) 



We learn the the weights w^^'^^, 



w 



bdy2 ^neg ^^^jct 



with our approach, and 



set e = 1 and C to be equal to the number of training examples. We also employ particle 
convex belief propagation (PCBP) [Peng et"aL (2011) for inference. PCBP is an iterative 
algorithm that works as follows: For each random variable, particles are sampled around the 
current solution. These samples act as labels in a discretized graphical model which is solved 
to convergence using convex belief propagation Hazan and Shashua (2010). The current 
solution is then updated with the MAP estimate obtained on the discretized graphical 
model. This process is repeated for a fixed number of iterations. In our implementation, 
we use the distributed message passing algorithm of Schwing et al. (2011) to solve the 
discretized graphical model at each iteration. 



We perform experiments on the challenging KITTI dataset Geiger et al. (2012), which is 
the only real-world stereo dataset with accurate ground truth. It is composed of 194 training 
and 195 test high-resolution images (1237.1 x 374.1 pixels) captured from an autonomous 
driving platform driving around in a urban environment. The ground truth is generated 
by means of a Velodyne sensor which is calibrated with the stereo pair. This results in 
semi-dense ground truth covering approximately 30 % of the pixels. We employ 20 images 
for training, and utilize the remaining 174 images for validation purposes. 

We employ two different metrics. The first one measures the average number of non- 
occluded pixels which error is bigger than a fixed threshold. To test the extrapolation 
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Figure 5: Graphical models for (a) stereo (b) recognition. 
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GC+occ Kolmogorov and Zabih (2001) 


39.76 % 


40.97 % 


33.50 % 


34.74 % 


29.86 % 


31.10 % 


27.39 % 


28.61 % 


OCV-BM Bradski (2000) 


27.59 % 


28.97 % 


25.39 % 


26.72 % 


24.06 % 


25.32 % 


22.94 % 


24.14 % 


CostFilter Rhemann et al. (2011) 


25.85 % 


27.05 % 


19.96 % 


21.05 % 


17.12 % 


18.10 % 


15.51 % 


16.40 % 


GCS Cech and Sara (2007) 


18.99 % 


20.30 % 


13.37 % 


14.54 % 


10.40 % 


11.44 % 


8.63 % 


9.55 % 


GCSF Cech et al. 


(2011) 


20.75 % 


22.69 % 


13.02 % 


14.77 % 


9.48 % 


11.02 % 


7.48 % 


8.84 % 


SDM Kostkova ( 


2003) 


15.29 % 


16.65 % 


10.98 % 


12.19 % 


8.81 % 


9.87 % 


7.44 % 


8.39 % 


ELAS Geiger et al. (2010) 


10.95 % 


12.82 % 


8.24 % 


9.95 % 


6.72 % 


8.22 % 


5.64 % 


6.94 % 


OCV-SGBM Hirschmueller (2008) 


10.58 % 


12.20 % 


7.64 % 


9.13 % 


6.04 % 


7.40 % 


5.04 % 


6.25 % 


ITGV Ranftl et al. 


,(,2012j 


8.86 % 


10.20 % 


6.31 % 


7.40 % 


5.06 % 


5.97 % 


4.26 % 


5.01 % 


Ours 


6.25 % 


7.78 % 


4.13 % 


5.45 % 


3.18 % 


4.32 % 


2.66 % 


3.66 % 



Table 1: Comparison with the state-of-the-art on the test set of KITTI Geiger et al. (2012) 



capabilities of the different approaches, the second metric computes the same metric, but 
including the occluding pixels as well. We employ this metrics as our loss. Table [T] depict 
results of our approach and the baselines in terms of the two metrics. Note that our approach 
significantly outperforms all the baselines in all settings (i.e., thresholds bigger than 2, 3, 4 
and 5 pixels). Fig. [6] depicts an illustrative set of KITTI examples. Despite the challenges, 
our approach does a good job at estimating disparities. 



6.3 Semantic Segmentation 



While there has been significant progress in solving tasks such as image labeling Ladicky 



et al. 



et al. 



(2010a), object detection Felzenszwalb et al. (2010) and scene classification 



Xiao 



20101 ), existing approaches could benefit from solving these problems jointly Heitz 



et al.| (2008). For example, segmentation should be easier if we know where the object of 
interest is. Similarly, if we know the type of the scene, we can narrow down the classes 
we are expected to see, e.g., if we are looking at the sea, we are more likely to see a boat 
than a cow. Conversely, if we know which semantic regions (e.g., sky, road) and which 
objects are present in the scene, we can more accurately infer the scene type. Holistic scene 
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Figure 6: KITTI examples. (Left) OriginaL (Middle) Disparity. (Right) Disparity errors. 
Raquel: Add more results 



understanding aims at recovering multiple related aspects of a scene so as to provide a 
deeper understanding of the scene as a whole. 



In recent work, Yao et al. (2012), we have proposed an approach to holistic scene un- 



derstanding that simultaneously reasons about regions, location, class and spatial extent of 
objects, as well as the type of scene. We frame the holistic problem as a structured pre- 
diction problem in a graphical model defined over hierarchies of regions of different sizes, 
as well as auxiliary variables encoding the scene type, the presence of a given class in the 
scene, and the correctness of the bounding boxes output by an object detector. For objects 
with well-defined shape (e.g., cow, car), we additionally incorporate a shape prior that takes 
the form of a soft mask learned from training examples. Unlike existing approaches that 
reason at the (super-) pixel level, we employ Arbelaez et al. (2011) to obtain (typically 
large) regions which respect boundaries well. This enables us to represent the problem 
using only a small number of variables. Learning and inference are efficient in our model as 
the auxiliary variables we utilize allow us to decompose the inherent high-order potentials 
into pairwise potentials between a few variables with small number of states (at most the 
number of classes). 

We now briefly describe the graphical model as well as the potentials employed. We 
refer the reader to Fig. [5] (right) for an overview of our model, and to |Yao et al. (2012) 
for more details and results. Let G {1, • • • , C} be a random variable representing the 
class label of the i-th segment in the lower level of the hierarchy, while G {1, • ' ' ? C'} is 
a random variable associated with the class label of the j-th segment of the second level of 

( |2010bD , 



the hierarchy. Following recent approaches Ladicky et al. (2010a); Lee et al. 



we 

represent the detection problem with a set of candidate bounding boxes. Let hi G {0, 1} 
be a binary random variable associated with a candidate detection, taking value when 



the detection is a false detection. We use the detector of [Felzenszwalb et al. (2010) to 
generate candidate detections, which provides us with an object class, a score, the location 
and aspect ratio of the bounding box, as well as the root mixture component ID that has 
generated the detection. The latter gives us information about the expected shape of the 
object. Let Z]^ G {0, 1} be a random variable which takes value 1 if class k is present in the 
image, and let s G {1, . . . , be a random variable representing the scene type. 
We define our holistic conditional random field as 



p(a) = p(x, y, z, b, = i n n ^*''"(^'- 



(8) 



type 



where a = (x,y,z,b,5) represents the set of all segmentation random variables, x and y, 
the set of C binary random variables z representing the presence of the different classes in 
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the scene, the set of all candidate detections b, and '0^^^^ encodes potential functions over 
sets of variables. Note that the variables in a region r can be of the same task (e.g., two 
segments) or different tasks (e.g., detection and segmentation). 

We compute the unary potential for each region at segme nt (l\^^^ {xi) and super-segment 
level (j)Y^^{yj) by averaging the TextonBoost Ladicky et al. (2010a) pixel potentials inside 
each region. We use potentials Kohli et al. (2009), (j)^- (x^, ?/j), to encourage that seg- 



ments and supersegments agree on their class labels. Additionally, unary potentials (pf (bi) 
represent the score of the detector for that hypothesis squash by a sigmoid. We train a clas- 
sifier for each scene type and represent its score in ^^^^^^(s). We additionally incorporate 
the shape prior by placing a mask representing the typical shape of the training examples 
that fell in that DPM component, and encouraging the segments inside the bounding box 
to take the same label as the detector, with strength proportional to the mask value on that 
segment. This is encoded in (/>^^^^^(6^, xj). We also incorporate statistics of class occurrance 
and co-occurances as unary and pairwise potentials (j)f^^^{zi) and (jf-^~^^^{zi^ zj) respectively. 
(jl^-{zi^ Hj) ensures that the classes that are inferred to be present in the scene are compatible 
with the classes that are chosen at the segment level, while (f)\j{bi,Zj) ensures that when a 
bounding box is on, its class is also present in the scene. Finally, <jf^j{s^ zj) encodes statistics 
of class occurrences for each scene type. The energy of the holistic graphical model is then 
defined as 



£;(x,y,z,b,5) 



w 



scene i scene 



|x| 



W 



|y| 



xseg -xseg 
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bz^ 



E 



lzl 

^shape^shape^^^^^^.^^^^sz^sz^^ 



: (zi^yj) 

We employ our approach to learn the weights with e = 1 and C = 0.02. To deal with 
our holistic setting, we employ a holistic loss which takes into account all tasks. We define 
it to be a weighted sum of losses, each one designed for a particular task, e.g., detection, 
segmentation. In order to do efficient learning, it is important that the losses decompose 
as a sum of functions on small subsets of variables. Here, we define loss functions which 
decompose into unitary terms. In particular, we define the segmentation loss at each level 
of the hierarchy to be the percentage of wrongly predicted pixels. This decomposes as sums 
of unitary terms (one for each segment). We utilize a 0-1 loss for the variables encoding 
the classes that are present in the scene, which also decomposes as the sum of unitary 0-1 
losses on each z^. We define a 0-1 loss over the scene type, and a PASCAL loss over the 
detections which decomposes as the sum of losses for each detection. 

We test our approach on the tasks of semantic segmentation on the MSRC-21 dataset 



Shotton et al. ([2008^. We employ Arbelaez et al. (2011) to obtain regions which respect 
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boundaries well, and set the watershed threshold to be 0.08 and 0.16 for the two layers in 
the hierarchy. To create the unitary potentials for the scenes, we use a standard bag-of- 
words spatial pyramid with 1, 2 and 4 levels over a 1024 sparse coding dictionary on SIFT 
features, color SIFT, RGB histograms and color moment invariants, and train a linear one-vs- 
all SVM classifier. We use the detector of Felzenszwalb et al. (2010) to generate candidate 
detections. For each detector we lowered the threshold to produce over-detections. We 
follow Felzenswalb et al.'s entry in PASCAL'09 to compute the soft shape masks. For 
each class we ran the detector on the training images and chose those that overlaped with 
groundtruh more than 0.5 in the intersection over union measure. For each positive detection 
we also recorded the winning component. We compute the mask for each component by 
simply averaging the groundtruth class regions inside the assigned groundtruth boxes. Prior 
to averaging, all bounding boxes were warped to the same size, i.e., the size of the root filter 
of the component. To get the shape mask for each detection we warped the average mask 
of the detected component to the predicted bounding box. 

MSRC-21 contains classes such as sky, water, as well as more shape-defined classes such 
as cow, car. We manually annotated bounding boxes for the latter classes, with a total 
of 15 classes and 934 annotations. We also annotated 21 scenes, taking the label of the 
salient object in the image, if there is one, or a more general label such as "city" or "water" 
otherwise. We follow the standard error measure of average per-class accuracy as well as 
average per-pixel accuracy, denoted as global Ladicky et al. (2010b). We used the standard 
train/test split Shotton et al. (2008) to train the full model, the pixel unary potential, object 
detector and scene classifier. 
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Table 2: MSRC-21 segmentation results 



Table [2] reports the segmentation accuracy, along with the comparisons with the existing 
state-of-the-art. Our joint model achieves the highest average accuracy reported on this 
dataset to date. Furthermore, the joint model not only improves segmentation accuracy 
but also significantly boosts object detection and scene classification. Scene classification 
improves from 79.5% to 80.6%, while detection improves from 44.6% to 50.7% recall at 
equal false positive rate. The average precision of the detector also improves from 48.2% 
to 49.3%. This is notable as context re-scoring Felzenszwalb et al. (2010) fails and reduces 
performance to 45.7%. We conjecture that this is due to the small number of training 
examples. Fig. [7| shows some good segmentation examples, as well as some failure modes, 
which are due to very bad unary segmentation potentials or when several tasks agree on 
the wrong class. 
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PIcene type: TREE scene type: TREE 




(Successful cases) 




(Failure modes) 

Figure 7: Segmentation examples: (image, groundtruth, our holistic scene model) 



6.4 3D indoor scene understanding 

Most existing approaches to recovering the spatial layout of indoor scenes rely on the Man- 
hattan world assumption, which states that there exist three dominant vanishing points 
which are orthogonal. They typically formulate the problem as a structured prediction 
task, which estimates the 3D box that best approximates the scene layout [Hedau et al. 
(|2009|); iLee et all (|2010a|); I Wang et al.l (|2010|). Two different parameterizations have been 



proposed for this problem, both assuming that the three dominant vanishing points can 
be rehably detected. In iHedau et al.l (|2009|); ILee et al.l (|2010a|), candidate 3D boxes are 



generated, and inference is formulated in terms of a single high dimensional discrete random 
variable. Hence, one state of such a variable denotes one candidate 3D layout. This limits 
significantly the amount of candidate boxes, e.g., only ^ 1000 candidates are employed 



in Hedau et al. (2009). In contrast, Wang et al. (2010) parameterize the layout with four 
discrete random variables, that correspond to the angles encoding the rays that originate 
from the respective vanishing points. An illustration of this parameterization is shown in 
Fig. H (left). 

Existing approaches employ potentials based on different image information. Geometric 
context iHoiem et al.l (|2007|), orientation maps ILee et al.l (|2009|) as well as lines in accor- 



dance with vanishing points Wang et al. (2010) are amongst the most successful cues. The 
complexity of learning and inferece is determined by the order of the potentials - the num- 
ber of variables involved and their size - that encode the image features. These potentials 
are typically unary, pairwise as well as higher-order (i.e., order four), and count for each 
face the number of pixels labeled with a particular label. The order is even higher when 



reasoning about clutter in the form of hidden variables Wang et al. (2010) (i.e., order five) 



or objects present in the scene that restrict the hypothesis space Lee et al. (2010a). While 
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Figure 8: Layout estimation: (Left) Parameterization of the problem. (Right) Compar- 
ison to the state-of-the-art that uses the same image information on the layout 
data set of |Hedau et al.| (|2009[). Pixel classification error is given in 



70. 



the aforementioned approaches perform well in practice, to tractably handle learning and 
inference with both parameterizations, reductions on the search space were proposed and/or 
a limited amount of labelings was considered. 



In contrast, in recent work Schwing et al. (2012) we have proposed a novel and efficient 



approach to discriminatively predict the 3D layout of indoor scenes. In particular, we gen- 
eralize the concept of integral images to "integral geometry," by constructing accumulators 
in accordance with the vanishing points. We showed that utilizing this concept, as all po- 
tentials represent counts, their computation can be reduced to sums of pairwise potentials. 
As a result, learning and inference is possible without further reduction of the search space. 

We evaluated our approach on the data set of jHedau et al. (2009), which contains 314 



images with ground truth annotation of layout faces. We employed the vanishing point 



detection of Hedau et al. (2009), which failed in 9 training images and was successful for all 



test images, 105 in total. We use a pixel based error measure, counting the percentage of 
pixel that disagree with the provided ground truth labeling. We compare our approach to 
the state-of-the-art . Similar to Lee et al. ( 2010a[ ), we report results when using different 
sets of image features, i.e., orientation maps (OM), geometric context (GC), and both 
(OM+GC). We denote by [Hedau et al.| ([2509| ) (a) 
estimate the layout, and by 



when the GC features are used to 



Hedau et al. ( 2009[ ) (b), when the layout is used to re-estimate 



the GC features, and these new features are used to improve the layout. As shown in Fig. [8] 
(right), our approach is able to significantly outperform the state-of-the-art in all scenarios, 
with our smallest error rate when using all features being 13.59%. We improve the state- 
of-the-art by 3.6% for the OM features, by 5.8% for the GC features and by 5.0% when 
combining both feature cues. Importantly inference is very efficient and takes on average 
0.15 seconds per image. Fig. |9] depicts some successful examples, as well as failure modes. 



We refer the reader to jSchwing et al. (2012) for more details and results. 
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(a) Error: 1.25% 



(b) Error: 1.48% 




(c) Error: 1.50% 



(d) Error: 1.60% 




(e) Error: 48.56% 



(f) Error: 42.11% 



Hi 



Figure 9: Original image with estimated layout in red, (OM) and (GC) features. |(a)||(d)[ 
Examples of successful cases, [(e} (f): Failure modes. 



6.5 Shape reconstruction from monocular imaging 

Existing approaches to tackling monocular non-rigid surface reconstruction can be classified 



into (i) non-rigid structure- from- motion techniques Bregler et al. (2000); Xiao and Kanade 



( 2005| ); [Fayad et aL (2010) that exploit the availability of multiple images of different defor 
mations to reconstruct both 3D points and camera motion, and (ii) template-based methods 
Shen et al. (2010); PerrioUat et al. (2011); Brunet et al. (2011|) that rely on a reference im- 



age with known 3D shape to perform reconstruction from a single additional image of the 
deformed surface. In most cases, the aforementioned methods are specifically designed to 
handle feature point correspondences, and as a consequence, cannot make use of richer 
image information, such as full surface texture, or surface boundaries. More importantly, 
these methods become unsuitable when too few feature points can be reliably detected 
and matched. Several attempts have been proposed to leverage more complex image likeli- 



hoods Salzmann et al. (2008); Salzmann and Urtasun (2010). However, the resulting meth- 



ods rely on gradient-based optimization schemes that can easily get trapped in the many 
local maxima of these complex, non-smooth likelihoods. As a consequence, these methods 
have only been used either for frame-to-frame tracking, where the previous frame provides 



a good initialization Salzmann et al. (2008), or when large amounts of training data are 



available to learn a discriminative predictor that produces a good initialization Salzmann 



and Urtasun (2010). 



In recent work, (Salzmann and Urtasun, 2012), we have proposed to frame the problem 



as the one of inference in a graphical model. As this optimization is more global than 
gradient-based methods, it is also more robust to local maxima, thus yielding accurate re- 
constructions even in the absence of a good initialization. More specifically, we represent a 
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surface as a triangulated mesh, and define the random variables in the graphical model to 
be the rotations and translations of the individual mesh facets. To handle such continuous 
variables, we adopt particle convex belief propagation Peng et al. (2011) as our inference 
algorithm: We iteratively draw random samples around the current solution for each vari- 
able, compute the MAP estimate of the discrete graphical model defined by these samples 
using convex belief propagation jHazan and Shashua (2010), and update the current solution 
with this MAP estimate. This strategy lets us effectively explore the 3D shape space even 
when no good initialization is provided. We define potentials that encode surface bound- 
ary, facets coherence as well as template matching. We refer the reader to [Salzmann and 



Urtasun (2012) for more details and results. We employ our approach to find the weights of 



the individual terms in the likelihood. To speed-up inference, we first define the graphical 
model over a coarse mesh and perform gradient descent on a finer mesh with initial point 
the MAP of the coarse mesh. 

We compare our results against two baselines. The first one, later denoted by Shen09, 
corresponds to Shen et al. (2010) initialized with the reference shape, with the extension 
of Salzmann and Urtasun (2010) to allow for more general image likelihoods than feature 
point reprojection error. The second baseline, later denoted by SalzlO, follows the method 
of Salzmann and Urtasun (2010) and uses a Gaussian process (GP) predictor to initialize 
the shape before gradient-based optimization. To learn the GP predictor, we used the same 
training shapes as to learn the potential weights, and employ either noisy 2D point locations, 
or PHOG descriptors as input. To confirm that a simple coarse-to-fine optimization scheme 
is not enough to solve the problem, we also compare our results with a coarse-to-fine version 
of Shen et al. (2010), denoted by ShenOQ CTF. For all the basehnes, we used the same image 
likelihoods as for our method, together with the weights learned with our CRF formulation. 

We perform experiments using data obtained with a motion capture system DeformData[ 
The data consists of 3D reconstructions of reflective markers placed in a 9 x 9 regular 
grid of 160 x 160mm on a piece of cardboard deformed in front of 6 infrared cameras. 
Since no images are provided with the 3D data, we synthesized well- and poorly-textured 
images as before. We used 5 training examples to learn the potential weights. Fig. [lO 



depict the 3D errors with a coarse mesh and after refinement using a gradient descent 
approach. Our approach yields much more accurate reconstructions than the baselines. 
Interestingly, however, we outperform the baselines after refinement. This shows that our 
coarse results still provide a better initialization than the coarse version of Shen09. Note 
that with this poorly-textured surface, smoothness improves reconstruction, which seems 
natural since image information is much weaker. This, however, is not noticeably the case 
for the baselines. 

Finally, to show that our approach can also be applied to real images, we used two 
sequences of different deforming materials |DeformData . While these are video sequences, all 
the images were treated independently and initialized from the template mesh to illustrate 
the fact that our approach can perform reconstruction from a single input image. Since no 
training data is available for these surfaces, we used a single training example consisting 
of the template mesh with reference image to learn the potential weights. In Fig. 11, we 



visually compare our reconstructions to those of Shen09. We do not show the results of 
SalzlO, since with the template mesh as single training example, it would always predict 
the reference shape, and thus perform the same as Shen09. For the well-textured surface. 
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Figure 10: Reconstructing a piece of cardboard from well-textured images. 3D 

error when (a) using a coarse (3x3) mesh and no smoothness, and (b) refining the 
results of (a) with a gradient-based method. Shen09 and SalzlO were directly 
obtained using a fine mesh. Note that our coarse results give a much better 
initialization for the refinement step. 



Shen09 manages to reconstruct fairly large deformations. However, as illustrated by the 
two leftmost columns of the figure for two very similar frames, it is less consistent than our 
approach. For the poorly-textured surface, the baseline is completely unable to cope with 
large deformations. Our approach, however, still manages to reconstruct the surface. In 
the rightmost column of the figure, we show a failure case of our approach, where the facet 
orientation is ambiguous. Furthermore, the topology of the coarse mesh makes it harder to 
bend the surface along this diagonal. Note, however, that as opposed to the baseline, we 
still recover some degree of surface deformation. 



7. Related work 



In this work we learn the parameters of region based structured predictors using pseudo 
moment matching and entropy approximations, or equivalent ly low-dimensional extended 
log-loss. We also construct an efficient message-passing algorithm and show how it achieves 



state-of-the-art results in stereo estimation ( [Yamaguchi et al. (2012)), semantic segmenta- 
tion (Yao et al. ( 20121)), shape recons truction ( [Salzmann and Urtasun] ( |2012| )), and indoor 
scene understanding (Schwing et al. ( 2012[ )). This work extends the framework of Hazan 
and Urtasun (2010) while simplifying its theoretical and practical concepts. Theoretically, 



it extends Hazan and Urtasun (2010) to general region graph, introduces the notion of 



extended log-loss and investigates the penalty method in message-passing. Practically, it 
emphasizes the importance of graph based predictors and show how to use them to achieve 
state-of-the-art results in several computer vision applications. 



The extended log-loss reduces to the hinge- loss, described by Taskar et al. (2004); 



(2010). This extension is implied by Hazan and Urtasun (2010); Pletscher et al. (2010), 



Tsochantaridis et al.| (|2006l), and the log- loss of |Sha and Saul| (|2007); Gimpel and Smith 



that unify the log-loss and the hinge-loss using a temperature parameter in the logarithm 



of the partition function. This extension, usually referred as a soft-max (cf. Vontobel and 



Koetter (2006); Johnson et al. (2007); Hazan and Shashua ( 2010[ )), presents our extended 
log-loss using a linear term and a soft-max. 
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Figure 11: Reconstructing surfaces from real images. From top to bottom: Our 
reconstructions reprojected on the original images, side view of our reconstruc- 
tions, reconstructions obtained with Shen09 CTF reprojected on the original 
images, side view of those reconstructions. For a well-textured surface, the 
baseline manages to reconstruct fairly large deformations, but is less consistent 
than our approach, as illustrated for two very similar frames. For a poorly- 
textured surface, the baseline only manages to reconstruct small deformations, 
whereas our approach can deal with much larger ones. The rightmost column 
shows a failure of our method due to an ambiguity in the facet reconstruction 
and to the use of a coarse mesh. 



The message-passing techniques we are using in this work rely on the region graph 



message-passing of |Heskes (2006) and the region norm-product algorithm of Hazan et al. 



(2012). Section 2.2 describes how to use these algorithms in a black-box manner and their 
computational disadvantages. The main purpose of this work is to integrate these methods 
to efficiently learn the parameters of structured predictors in general graphical models. 

The extended log-loss minimization, as appears in Corollary [2} reduces to CRFs when 
setting e = 1. CRFs, defined by Lafferty et al. (2001); Lebanon and Lafferty (2002), are 
widely applied in machine learning. Whenever the labels are in a discrete product space, i.e., 
y — ?/^), the gradient is exponentially hard to compute. In this case, approximate 

inference techniques can be used to estimate the gradient (e.g.. Levin and Weiss (2006); 



Yanover et al. (2007)). This approach, as described in Section [2^2} requires running an ap 



proximate inference algorithm for every gradient step thus it is computationally intractable 
in general. To apply CRFs efficiently in discrete product spaces, some works focus on the 



practical aspects of low dimensional approximations for learning CRFs parameters. [Sutton 



and McCallum (2009) present the piecewise training approach which uses a low dimensional 



log-loss while ignoring the consistency messages A. In our setting, these messages are used 
to enforce the marginalization constraints of the dual program. Ganapathi et al. (2008) 
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approximate CRFs using the non-concave Bethe entropy and execute a double loop algo- 
rithm while showing that it requires only few outer-loop iterations. The inner loops of this 
algorithm use concave entropy approximations, and are computed using BFGS. Our work 
is different as it considers efficient algorithms for parameter learning with concave entropy 
approximations that take advantage of the graphical model by sending messages along its 
edges. We show in the experiments that this significantly improves the run-time of the 
algorithm. Other works focus on the theoretical aspects of fractional entropy approxima- 
tions. Wainwright (2006) prove that whenever one wishes to learn parameters jointly with 



prediction, it is preferred to choose concave entropy approximations since their parameters 
are stable with respect to their prediction. The theoretical foundations of concave entropy 



approximations in parameter learning and pseudo moment matching appear in Wainwright 



et al. (2003) and inspire our work. In particular, our work provides a detailed derivation 



of the primal and dual programs for graph based features, as well as efficient message- 
passing algorithms which we use to attain state-of-the-art results in several computer vision 
applications. 

The extended log-loss minimization, as appears in Corollary |2| reduces to structured 
SVMs for e = 0. Structured SVMs are defined in Taskar et al. (20041); Tsochantaridis et al. 



(2004) and are motivated by the structured perceptron of Collins (2002). Structured SVMs 



are very popular in machine learning, but in general the gradient computation requires 
to solve the max-function over a discrete product space, thus it is NP-hard to compute. 



Therefore in general setting, straight forward subgradient methods, such as Collins (2002); 



Roth and Yih (2005); Ratliff et al. (2007); Shalev-Shwartz et al. (2007), cannot be applied 



as they require to solve a NP-hard problem for each gradient step. One can a low dimen- 
sional max-function, while ignoring the consistency messages A (e.g., 'Punyakanok et al. 



( 2005| )) which we use to enforce the marginalization constraints of the dual program. Alter- 
natively, one can relax the NP-hard max-function in a similar manner to our dual program, 
introducing beliefs and marginalization constraints. For e = this approach boils down to 



running a linear program solver for every gradient step, e.g., Kulesza et al. (2007); Finley 



and Joachims ( [2008) , thus it is computationally intractable in general. To relax the com- 
putational burden of the max-function, Tsochantaridis et al. (2006) use the cutting plane 



method. It was shown that the number of added constraints is polynomial (e.g., Joachims 



et al. (2009)), but in practice finding a cutting plane may be hard and the number of added 



constraints may be large. Taskar and collaborators use a different approach to deal with 
the computational complexity of the max-function. Specifically, they consider structured 
SVM which can be solved efficiently using Lagrange multipliers and duality. [Taskar et al. 
(|2004|) introduce the structured SMO to solve the structured SVM dual. In our work we 



avoid solving the dual program since it is a constrained optimization problem. In contrast, 
our primal program is unconstrained thus can usually be solved faster using block coordi- 



nate steps, and can be distributed and parallelized easily, as described in |Schwing et al. 
(|2011|). [Taskar et al.| (|2005[); [Anguelov et al.| ([2005') consider the hinge loss conjugate dual 



and integrate it into the primal program, thus effectively replacing the max-function of 
the hinge-loss with a min-function. Although these formulations are applied to settings for 
which the maximum can be computed efficiently (e.g., associative networks and matchings), 
this dual-primal concepts play an important role in our derivation of the low dimensional 



primal program. Meshi et al. ( 2010^ further improve this idea and their primal program as 
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well as their algorithm are similar to our primal program and message-passing algorithm, 
when restricting to e = 0. However, their setting is restricted to pairwise graphical mod- 
els while we consider graph-based features for general regions, which are important when 
applying these methods to real-life problems. In general, when restricting to e = 0, one 
is bound to update the parameters using subgradient steps. Despite the theoretical guar- 
antees of subgradient methods, these methods tend to be slow in many cases of interest 
and for different settings of C, e.g, Shalev-Shwartz et al. (2007). Furthermore, restricting 
to 6 = 0, it is hard to verify the subgradient algorithm reaches an optimal solution: since 
the max-function is non-smooth, it is hard to recover a dual feasible solution to verify a 
small duality gap. In general, we find that setting e > gives better results faster than 
setting 6 = 0, and it might be better to solve structured SVM without the shortcomings of 
subgradient methods by simply setting e ^ 0. 



8. Conclusion and Discussion 

In this paper we have related CRFs and structured SVMs through the extended log-loss for- 
mulation, thus showing how CRFs approximate smoothly the structured SVMs. We have also 
proposed low dimensional loss formulation which decomposes according to general regions 
in a graphical model, and its dual program corresponds to pseudo moments matching and 
fractional entropy approximations. We have derived an efficient message-passing algorithm 
for learning the parameters of graph based structured predictors and have demonstrated the 
effectiveness of our approach, achieving state-of-the-art results in several computer vision 
applications. We believe it is interesting to show in the future if this algorithm provides 
state-of-the-art performance in domains other than computer vision, or whether the statis- 
tics in computer vision are used by this approach in a special manner. 

The computational complexity of our algorithm depends on norm-products over the 
labels of regions. Therefore, efficient techniques over large regions in inference can be 



applied as sub-procedure in our algorithm, e.g., Kohli et al. (2009); Batra et al. (2010); 



Tarlow et al.| ( |2010[ |2011[ |2012D . 

The extended log-loss introduces a weight parameter e which controls the characteristics 
of the loss, e.g., for e = we recover the hinge loss for structured SVMs and for e = 1 we 
recover the log-loss for CRFs. The learning program also considers a constant C which 
controls the tradeoff between the extended log-loss and the regularization. We have shown 
that whenever the true loss is equivalently zero, these two parameters influence equally 
the learned parameters, and an important open problem is their influence in general loss 
settings. 

In our framework, we can enforce the moment matching constraints through general 
concave functions. These function translate to a regularization in the primal. For compu- 
tational efficiency we choose the square function but we did not investigate the different 
moment matching and regularization functions. Moreover, we enforce the marginalization 
constraints through indicator functions, in order to obtain closed-form solution in the pri- 
mal block coordinate descent. However, we have shown that using the penalty method 
we can enforce the marginalization constraints with different convex functions. We leave 
the affect of general convex functions on moment matching and regularization, as well as 
marginalization constraints and efficient message-passing for future research. 
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Interestingly, our approach confirms that the parameters of graph based structured 
predictors can be efficiently learned in many real-life problems. This validates the intuition 
behind the theoretical results of Wainwright et al.| ( |2003| ); Wainwright (2006) which asserts 
that whenever learning and inference occur together one can use pseudo moment matching 
for learning the parameters. This concept was put forward in the general framework of 
learning to reason by Khardon and Roth (1997) and we leave for future research to find 
different frameworks which have similar learning-prediction robustness that such algorithms 
might be effective. 
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